Towards End-to-End Speech Recognition

نویسندگان

  • Dimitri PALAZ
  • Dimitri Palaz
چکیده

Standard automatic speech recognition (ASR) systems follow a divide and conquer approach to convert speech into text. Alternately, the end goal is achieved by a combination of sub-tasks, namely, feature extraction, acoustic modeling and sequence decoding, which are optimized in an independent manner. More recently, in the machine learning community deep learning approaches have emerged which allow training of systems in an end-to-end manner. Such approaches have found success in the area of natural language processing and computer vision community, and have consequently peaked interest in the speech community. The present thesis builds on these recent advances to investigate approaches to develop speech recognition systems in end-to-endmanner. In that respect, the thesis follows twomain axes of research. The first axis of research focuses on joint learning of features and classifiers for acoustic modeling. The second axis of research focuses on joint training of the acoustic model and the decoder, leading to an end-to-end sequence recognition system. Along the first axis of research, in the framework of hybrid hidden Markov model/artificial neural networks (HMM/ANN) based ASR, we develop a convolution neural networks (CNNs) based acoustic modeling approach that takes raw speech signal as input and estimates phone class conditional probabilities. Specifically, the CNN has several convolution layers (feature stage) followed by multilayer perceptron (classifier stage), which are jointly optimized during the training. Through ASR studies on multiple languages and extensive analysis of the approach, we show that the proposed approach, with minimal prior knowledge, is able to learn automatically the relevant features from the raw speech signal. This approach yields systems that have less number of parameters and achieves better performance, when compared to the conventional approach of cepstral feature extraction followed by classifier training. As the features are automatically learned from the signal, a natural question that arises is: are such systems robust to noise? Towards that we propose a robust CNN approach referred to as normalized CNN approach, which yields systems that are as robust as or better than the conventional ASR systems using cepstral features (with feature level normalizations). The second axis of research focuses on end-to-end sequence recognition. We first propose an end-to-end phoneme recognition system. In this system the relevant features, classifier and the decoder (based on conditional random fields) are jointly modeled during training. We demonstrate the viability of the approach onTIMIT phoneme recognition task. Building on top of that, we investigate a “weakly supervised” training that alleviates the necessity for frame level alignments. Finally, we extend the weakly supervised approach to propose a novel keyword spotting technique. In this technique, a CNN first process the input observation sequence

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Low-power speech processing based upon floating-gate circuits - Signals, Systems & Computers, 2003 The Thrity-Seventh Asilomar Conference on

Absmoel-This paper describes our current efforts towards creating cooperative analogtdigital signal processing (CADSP) systems for auditory sensor and signal processing applications. We address resolution issues that affect the choice of signal processing algorithms arriving from an analog sensor. We discuss current analog circuit approaches towards the frout-end signal processing by reviewing ...

متن کامل

Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic features for automatic speech recognition (ASR). Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models (HMMs/GMMs) have achieved the state-of-the-art in various benchmarks. Meanwhile, Connectionist Temporal Cla...

متن کامل

Towards end-to-end speech recognition for Chinese Mandarin using long short-term memory recurrent neural networks

End-to-end speech recognition systems have been successfully designed for English. Taking into account the distinctive characteristics between Chinese Mandarin and English, it is worthy to do some additional work to transfer these approaches to Chinese. In this paper, we attempt to build a Chinese speech recognition system using end-to-end learning method. The system is based on a combination o...

متن کامل

Towards Language-Universal End-to-End Speech Recognition

Building speech recognizers in multiple languages typically involves replicating a monolingual training recipe for each language, or utilizing a multi-task learning approach where models for different languages have separate output labels but share some internal parameters. In this work, we exploit recent progress in end-to-end speech recognition to create a single multilingual speech recogniti...

متن کامل

SpeakQL: Towards Speech-driven Multi-modalerying

Natural language and touch-based interfaces are making data querying signi€cantly easier. But typed SQL remains the gold standard for query sophistication although it is painful in many querying environments. Recent advancements in automatic speech recognition raise the tantalizing possibility of bridging this gap by enabling spoken SQL queries. In this work, we outline our vision of one such n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016